1 research outputs found
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Most speech and language technologies are trained with massive amounts of
speech and text information. However, most of the world languages do not have
such resources or stable orthography. Systems constructed under these almost
zero resource conditions are not only promising for speech technology but also
for computational language documentation. The goal of computational language
documentation is to help field linguists to (semi-)automatically analyze and
annotate audio recordings of endangered and unwritten languages. Example tasks
are automatic phoneme discovery or lexicon discovery from the speech signal.
This paper presents a speech corpus collected during a realistic language
documentation process. It is made up of 5k speech utterances in Mboshi (Bantu
C25) aligned to French text translations. Speech transcriptions are also made
available: they correspond to a non-standard graphemic form close to the
language phonology. We present how the data was collected, cleaned and
processed and we illustrate its use through a zero-resource task: spoken term
discovery. The dataset is made available to the community for reproducible
computational language documentation experiments and their evaluation.Comment: accepted to LREC 201